In [1]:
import pandas as pd
bike_rentals = pd.read_csv("bike_rental_hour.csv")
print(bike_rentals.head())
   instant      dteday  season  yr  mnth  hr  holiday  weekday  workingday  \
0        1  2011-01-01       1   0     1   0        0        6           0   
1        2  2011-01-01       1   0     1   1        0        6           0   
2        3  2011-01-01       1   0     1   2        0        6           0   
3        4  2011-01-01       1   0     1   3        0        6           0   
4        5  2011-01-01       1   0     1   4        0        6           0   

   weathersit  temp   atemp   hum  windspeed  casual  registered  cnt  
0           1  0.24  0.2879  0.81        0.0       3          13   16  
1           1  0.22  0.2727  0.80        0.0       8          32   40  
2           1  0.22  0.2727  0.80        0.0       5          27   32  
3           1  0.24  0.2879  0.75        0.0       3          10   13  
4           1  0.24  0.2879  0.75        0.0       0           1    1  
Each row represents one hour. Our target column will be "cnt" which represents the total number of bikes rented that hour.
In [2]:
# Plotting "cnt" column
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(bike_rentals["cnt"])
Out[2]:
(array([ 6972.,  3705.,  2659.,  1660.,   987.,   663.,   369.,   188.,
          139.,    37.]),
 array([   1. ,   98.6,  196.2,  293.8,  391.4,  489. ,  586.6,  684.2,
         781.8,  879.4,  977. ]),
 <a list of 10 Patch objects>)
In [3]:
# Printing out how each column correlates with the "cnt" column. 
bike_rentals.corr()["cnt"]
Out[3]:
instant       0.278379
season        0.178056
yr            0.250495
mnth          0.120638
hr            0.394071
holiday      -0.030927
weekday       0.026900
workingday    0.030284
weathersit   -0.142426
temp          0.404772
atemp         0.400929
hum          -0.322911
windspeed     0.093234
casual        0.694564
registered    0.972151
cnt           1.000000
Name: cnt, dtype: float64
In [4]:
# Creating "time_label" column, which will give our algorithm information about how certain hours are related (Morning, Afternoon, etc.)
def assign_label(hr):
    if hr >= 6 and hr < 12:
        return 1
    elif hr >= 12 and hr < 18:
        return 2
    elif hr >= 18 and hr <= 24:
        return 3
    elif hr >= 0 and hr < 6:
        return 4
In [5]:
bike_rentals["time_label"] = bike_rentals["hr"].apply(assign_label)

Error Metric:

We are working with continuous numeric data, so Mean Squared Error will work well here.

In [6]:
# Spliting dataframe into train and test sets.
train = bike_rentals.sample(frac=0.8, random_state=1)
test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]
In [7]:
# Selecting columns to use in algorithm
cols = ["season", "yr", "mnth", "hr", "time_label", "holiday", "weekday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"]
In [8]:
# Training and testing a Linear Regression model, and then determining error metric.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[cols], train["cnt"])
predictions = lr.predict(test[cols])
In [9]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test["cnt"], predictions)
print(mse)
17054.9594635
This is a fairly high number for mean squared error and indicates Linear Regression is probably not our best option. Next we'll try a decision tree.
In [10]:
# Training and testing a Decision Tree model, and then determining error metric.
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(train[cols], train["cnt"])
predictions_dtr = dtr.predict(test[cols])
In [11]:
mse_dtr = mean_squared_error(test["cnt"], predictions_dtr)
print(mse_dtr)
3342.90491945
In [12]:
# Adjusting parameters of the DecisionTreeRegressor class to minimize model error.
dtr2 = DecisionTreeRegressor(max_depth=15, min_samples_leaf=3)
dtr2.fit(train[cols], train["cnt"])
predictions_dtr2 = dtr2.predict(test[cols])
mse_dtr2 = mean_squared_error(test["cnt"], predictions_dtr2)
print(mse_dtr2)
3170.10390078
The Decision Tree model performed much better than the Linear Regression model. Now we will try to create an even better model using Random Forest.
In [13]:
# Training and testing a Random Forest model, and then determining error metric.
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(train[cols], train["cnt"])
predictions_rfr = rfr.predict(test[cols])
mse_rfr = mean_squared_error(test["cnt"], predictions_rfr)
print(mse_rfr)
2203.00476058
In [14]:
# Adjusting parameters of the RandomForestRegressor class to minimize model error.
rfr2 = RandomForestRegressor(max_depth=17, min_samples_leaf=2)
rfr2.fit(train[cols], train["cnt"])
predictions_rfr2 = rfr2.predict(test[cols])
mse_rfr2 = mean_squared_error(test["cnt"], predictions_rfr2)
print(mse_rfr2)
2149.20005548

Random Forest models are typically one of the more accurate models for making predictions and as expected, our Random Forest model with certain parameters adjusted, performed best.